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ABSTRACT 

Efficient marketing or awareness-raising campaigns seek to 
recruit n influential individuals where n is the campaign 
budget that are able to cover a large target audience 
through their social connections. So far most of the re- 
lated literature on maximizing this network cover assumes 
that the social network topology is known. Even in such 
a case the optimal solution is NP-hard. In practice, how- 
ever, the network topology is generally unknown and needs 
to be discovered on-the-fly. In this work we consider an un- 
known topology where recruited individuals disclose their 
social connections (a feature known as one-hop lookahead). 
The goal of this work is to provide an efficient greedy online 
algorithm that recruits individuals as to maximize the size 
of target audience covered by the campaign. 

We propose a new greedy online algorithm. Maximum Ex- 
pected (i-Excess Degree (MEED), and provide, to the best of 
our knowledge, the flrst detailed theoretical analysis of the 
cover size of a variety of well known network sampling algo- 
rithms on flnite networks. Our proposed algorithm greedily 
maximizes the expected size of the cover. For a class of ran- 
dom power law networks we show that MEED simplifles into 
a straightforward procedure, which we denote MOD (Max- 
imum Observed Degree). We substantiate our analytical 
results with extensive simulations and show that MOD sig- 
niflcantly outperforms all analyzed myopic algorithms. We 
note that performance may be further improved if the node 
degree distribution is known or can be estimated online dur- 
ing the campaign. 

1. INTRODUCTION 

This paper addresses the need to efficiently select n in- 
dividuals in a network such that they cover, through their 
neighbors, the largest possible fraction of the network. On- 
line social networks have generated much attention as a 
breeding ground for new forms of social studies, social mobi- 
lization, and online campaigns. Recruiting individuals from 
a population - for instance, recruiting volunteers to get their 
friends to vote in an election is no easy task. The recruit- 
ment of each individual comes at a cost in time, money, 
and social capital; and the total budget is often small with 
respect to the total population. Moreover, recruitment is 
frequently targeted towards a subpopulation say, individ- 
uals that will likely vote for a given candidate that may 
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be a relatively small fraction of the whole population. Most 
works on network cover, e.g. |12II16|[29] , either consider the 
social network topology to be known in advance or assume 
the capability of a two-hop lookahead (where the identity of 
all nodes within a two-hop neighborhood of a recruited node 
are known), which is often not the case in the wild. 

In this work we look at the cover problem when the net- 
work topology is unknown. Following previous literature, we 
assume that any individual in the network can be recruited 
but in our case recruitments mostly happen through friends 
recruiting friends. This link-tracing technique has been long 
used by social scientists to sample hard-to-reach subpopula- 
tions [131 CSl im • The homophily often present in social net- 
works - the tendency for similar individuals to be friends [30] 

enables the likely effective recruitment of individuals that 
are either in the target subpopulation or know many un- 
recruited individuals in the target subpopulation. This is 
achieved simply by asking each recruited individual to refer 
other target individuals. 

The recent 2012 U.S. presidential election presents a real- 
life example of an application of link-tracing recruitments to 
maximize the network cover of a target subpopulation. A 
candidate's Facebook app asked its subscribers to send get- 
out-to-vote reminders to their like-minded friends in swing 
states [19]. Thus, the effectiveness of a subscriber is mea- 
sured by the amount of its friends that live in swing states. 
Moreover, these messages also raised awareness of the app 
itself, allowing it to spread through the target subpopula- 
tion of interest (see also Bond et al. [7 for a description of 
a get-out-to-vote Facebook app experiment in the 2010 U.S. 
elections). 

Problem Formulation 

We formulate the target subpopulation cover problem as a 
maximum connected cover (MCC) problem on an unknown 
connected graph G — {V^E) (we also refer to G as a net- 
work), where V is the set of target individuals and E the set 
of individuals' mutual connections. We assume all graph pa- 
rameters are unknown. Our analysis can be easily extended 
to a disconnected network by considering each connected 
component separately. 

Our main goal is to design efficient online greedy algo- 
rithms to solve the following problem on G: let n be a given 
campaign budget; we want to determine a group of n indi- 
viduals to be recruited in order to maximize the size of the 
covered subset, i.e. of the set including the recruited nodes 
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and their neighbors. Our only initially available information 
is a single node sampled from the population. Later we can 
acquire the neighborhood of any recruited node. It follows 
that the recruited nodes form a connected subgraph. 

More formally, let B{t) be the set of t known target in- 
dividuals after t recruitments (also denoted the t-th step). 
Let B{0) be the set containing the initially known individ- 
ua QLet J\f(B{t)) be a function that returns the set of un- 
recruited neighbors of B{t) C V; Fig. [l] illustrates B{t) and 
N{B{t)). The online algorithm proceeds as follows: at step 
t, < t < n, the algorithm recruits node v G M{B{t — 1)) 
and performs the update B{t) — B{t — 1) U {v}. The abil- 
ity to obtaining the identities of the neighbors of recruited 
nodes, Af{B(t)), Vt, is known as one-hop lookahead in the 
graph sampling literature [UlSj. The objective of the on- 
line algorithm is to try to maximize the size of the network 
cover set = B{t) U J\f{B{t)), for t = 1, . . . , n, without 

having a priori access to topology information. We refer to 
the problem of online covering an unknown network in the 
presence of one-hop lookahead as the online myopic network 
covering problem. 

Contributions 

We make the following contributions: 

(1) We thoroughly evaluate analytically and through ex- 
tensive simulations on social network dataset^ the per- 
formance of several known network sampling algorithms. 
(1.1) We investigate the cover sizes of Breadth-First Search 
(BFS) and Depth-First Search (DFS). We observe a con- 
sistent large variance in the cover sizes found by BFS and 
that BFS tends to underperform in comparison to a greedy 
oracle scheme that recruits at every step the node with 
the largest number of uncovered neighbors. We partially 
blame network homophily for the lack of performance from 
BFS. DFS, which at first sight should improve upon BFS 
in circumventing the above homophily problem, performs 
even worse. Using random networks, we show why DFS 
finds a small cover after t ^ N recruited nodes. (1.2) A 
Random Walk (RW), more precisely RW without replace- 
ment (RWnr), where nodes revisited by the walker are not 
counted towards the recruitment budget, is shown to con- 
sistently outperform (sometimes significantly) BFS in our 
simulations. (1.3) We propose a new online algorithm in- 
spired by the Susceptible-Infected (SI) epidemic model but 
observe that RWnr is consistently more efficient than SI. 

(2) Our work is, to the best of our knowledge, the first to 
provide an analytical characterization of the sizes of W{t) 
(the cover) as a function of t (recruited nodes) , for RWnr and 
the SI epidemic algorithms on finite networks. As recently 
acknowledged in p^, this was a challenging open problem. 
Moreover, we establish an interesting connection between 
cover through RWs and the coupon subset collection prob- 
lem. We validate our theoretical results through simulations. 

(3) We propose a new online algorithm (MEED, Maximum 
Expected (i-Excess Degree) that greedily maximizes the ex- 
pected size of the cover. For a broad class of power law 
networks, MEED simplifies into a straight forward heuris- 



The analysis can be extended to consider many initially known in- 
dividuals. Note that the task of finding the initial set of nodes in the 
target subpopulation is a problem on its own [Tl 1281 . 
^ See Sec.lolfor some limited analysis of other "non-social" networks. 



tic, which we denote Maximum Observed Degree (MOD). 
We substantiate our analytical results with simulations. Ex- 
tensive simulations on a variety of social network datasets 
show that MOD consistently outperforms (sometimes signif- 
icantly) all other analyzed algorithms. Performance can be 
further improved if the node degree distribution is known or 
can be estimated online during the campaign. 

Outline 

The reminder of this work is organized as follows. Sec. [2] 
presents the notation and background used throughout this 
work. Sec.[3]discusses optimal solutions and approximations 
in connection to the connected minimum dominating set. 
Sec.|4|presents the datasets used in this work and our simula- 
tion setup. Sec. [5] provides an analysis of the effectiveness of 
Breadth-First-Search (BFS) and Depth-First-Search (DFS). 
Sec. [6] provides a deep analysis of the effectiveness of two 
types of random walks and compare them to BFS. Sec.[7|pro- 
poses a sampling algorithm inspired by Susceptible-Infected 
(SI) epidemic models. We also provide an analytical solu- 
tion describing the cover size of SI as a function of t. An 
important feature of our analysis is our ability to model 
finite graphs, which is key to understanding the effective- 
ness of large campaigns in respect to the size of the target 
population. Sec. |8] proposes MEED and MOD as a simple 
approximation of MEED. Sec. [S] also provides theoretical 
and simulation results, the latter comparing MOD against 
the other algorithms. And, finally. Sec. [9] summarizes our 
contributions and reviews the related work. 

2. NOTATION & BACKGROUND 

We consider an unknown connected network G — {V, E) 
with = \V\ nodes, M — \E\ edges, and degree distri- 
bution {pk}k=i,...N-i- We assume all graph parameters 
are unknown to us. Denote Ma{v) the set of neighbors 
of node v ^ V ^ irrespective of their recruitment status, 
and kv — \Na{v)\ is the degree of v. For each step t — 
1, . . . , n, where n G {1, . . . , — 1} is the campaign bud- 
get, we classify the nodes in V into three disjoint sets. The 
set B{t) denotes the recruited nodes at step t] these are 
the black nodes in Fig. [l] Unrecruited neighbors of re- 
cruited nodes are denoted observed nodes and form the set 
M{B{t)) = U^eB{t)J^a{v) - B(t) (gray nodes in Rg.[l]). We 
say a node v ^ V is covered at step t if G W(t), where 
>V(t) = y - B{t) UAf(B{t)) isjhe set of all uncovered nodes 
(white nodes in Fig. [l]) and W{t) ^ V - W{t) its comple- 
ment. Note that at time t we are unaware of the existence 
of nodes in W{t). 




Figure 1: Network sampling evolving sets. 

The sizes of the three sets B{t), M{B{t)), and W{t) are de- 
noted B{t), ^{B{t)) and W{t), respectively. Clearly B{t) + 
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Variable 


Description 


N 


no. of nodes 


M 


no. of edges 




set of neighbors of G V 


{x) 


average value of quantity x 


Pk 


fraction of nodes with degree k 


Bit), (Bit)) 


set (number) of sampled nodes at step t 


AfiA), (N(A)) 


set (no.) of unrecruited neighs oi A C V 


W{t), (Wit)) 


set (no.) of uncovered nodes at step t 


Table 1: JNotation Table 



N{B{t)) -^W{t) ^ N at any step 1 < t < n. Finally, for 
the sake of simplicity, we allow a slight abuse of notation, 
denoting by (.) both the empirical mean and the expected 
value. The exact interpretation of (x) will then depend on 
the nature of the quantity x. We use the convention that 
(k) denotes the average degree. Table [l] summarizes the 
notation used throughout the paper. 

Our analysis makes extensive use of the configuration ran- 
dom graph model 39 . This is a defined as a uniform prob- 
ability distribution over the ensemble of the graphs where 
nodes have a given degree distribution {pk}k=i,...N-i- A 
configuration model sample can be generated as follows. 
The degree ki is attributed to each node i according to 
the selected degree distribution. Each vertex i can then be 
thought of having ki stubs attached to it that are the ends of 
edges-to-be. By connecting randomly selected stubs' pairs 
the graph sample is generated. 

The configuration model is widely used in the complex 
network literature (5] I40j also for the simplicity of the anal- 
ysis. Moreover, as we will soon see, the formulas we derive 
to predict the value of N{B{t)) considering the configura- 
tion model match the results of our simulation for actual 
topologies remarkably well. 

3. NETWORK COVERS, ORACLES, 
& APPROXIMATE SOLUTIONS 

The problem we study is closely related to the well-studied 
Maximum coverage [36 and Minimum Connected Dominat- 
ing Set (MCDS) problems. The maximum coverage problem 
can be described in our setting as selecting at most n nodes 
such that the union of the nodes they cover has maximal size. 
The maximum coverage problem is NP-hard, and cannot be 
approximated within 1 — l/e-|-o(l), where e is the Euler 
constant. A simple submodular function greedy algorithm, 
however, is able to find a 1 — 1/e approximation [36]. Our 
problem setting, however, requires the recruited nodes to be 
connected to each other. In the connected setting, above 
mentioned greedy algorithm is similar to a greedy algorithm 
used to solve the MCDS, described as follows. 

Given a graph G = (V, E) with N nodes, DS C V is a 
dominating set ii\/vev • v G DS or v E V — DS. Thus, if all 
nodes in DS are recruited as dominators, it may be possible 
to reach all nodes in the network through these dominators. 
The MDS problem is to find the set DS with the minimum 
cardinality. MCDS imposes an additional restriction that 
the subgraph induced by the vertices in DS has to be con- 
nected. 

Our goal is not to cover all of G; instead, we seek to cover 
as much of G as possible with recruitment budget n. How- 
ever, since MCDS is very closely related to our problem, key 
results and techniques from the MCDS literature can pro- 



vide crucial insights into the role of lookahead in network 
coverage especially about worst-case performance guaran- 
tees when compared to the optimal solution. In a situa- 
tion where complete network knowledge is availabl^ solving 
MCDS is NP-hard [12 . However, there exist well-known lin- 
ear approximation preserving reductions (L-reductions i21 ) 
from the SetCover problem to MCDS [16 that yield a 
guaranteed approximation factor of O(lnn). 

Definition 3.1 (Observed degree). The number of 
recruited neighbors of a node. 

Definition 3.2 (cZ-excess degree). A node with de- 
gree k and observed degree d < k has excess degree k — d. 

If there is limited "lookahead", say, two-hop information 
of the neighborhood of each recruited node, the natural al- 
gorithm is to greedily recruit nodes that have the maximum 
number of uncovered neighbors, i.e., with the maximum ex- 
cess degree. Guha and Khuller [16 implemented this greedy 
algorithm by building growing a tree T in an online fashion, 
starting from a single node. Initially all nodes are unre- 
cruited (white). At each step, a vertex v E T with the 
largest excess degree is recruited (colored black) and edges 
are added to T which exist between v and all its neighbors 
that are not in T (these unrecruited neighbors are colored 
gray). The algorithm stops when all nodes are colored either 
gray or black, and the connected dominating set (CDS) is 
the set of non-leaf nodes in T. They showed that the above 
algorithm has a guaranteed approximation ratio of 0(A), 
where A is the maximum degree of the network. We refer 
to the aforementioned algorithm as "Oracle" as it requires 
two-hop lookahead in order to compute the excess degree of 
nodes in J\f{B{t))), a capability often missing in real online 
social networks. An example of an implementation of Guha 
and Khuller's Oracle can be found in Maiya and Berger- 
Wolf [29] (denoted Expansion Sampling in their work). 

Interestingly, Guha and Khuller also showed that the ap- 
proximation factor can be significantly improved when three- 
hop lookahead is exploited in a modified greedy step: recruit 
a pair of adjacent vertices (i.e. mark them black) and com- 
pare the yield in the number of gray nodes acquired in the 
neighborhood of this pair; at each step greedily select a pair 
of vertices or a single vertex that maximizes this yield. This 
modified greedy step surprisingly yields an approximation 
ratio of 0(ln A) instead of 0(A). This additional lookahead, 
however, is not be available in several practical settings that 
we are interested in studying in this paper. 

In practice, with one-hop lookahead, only observed degree 
information is available at nodes in B{t). This results in 
our MEED algorithm, which uses Guha and Khuller's Or- 
acle approach using the expected excess degree in place of 
the true excess degree. MEED, however, requires the degree 
distribution of the network as side information. In the ab- 
sence of degree distribution information, we show that for 
some random power law networks, a natural myopic online 
greedy algorithm of recruiting the node with the maximum 
observed degree approximates MEED this is our MOD al- 
gorithm and also Maiya and Berger-Wolf 's SEC scheme [29 . 
Expected value analysis as well as simulations in Sec.[8]show 
that MOD is a good heuristic when operating on realistic 
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This could arise in "intelligence gathering" applications where ana- 
lysts have pieced together the topology of an adversary network and 
now want to recruit the best (connected) set of influencers that will 
cover it. 
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social network such as those obeying a power law degree 
distribution. 

We note that due the online nature of the different algo- 
rithms presented above, it is easy to apply the maximum 
budget criteria of MCNC and stop whenever the allocated 
budget n has been consumed. While the theoretical approx- 
imation guarantees may not strictly apply (unless n — N), 
we believe that in practice they still hold for the networks 
studied in this paper. 



4. DATASETS & SIMULATION SETUP 

We use the Enron email dataset as the running example 
throughout this work. The Enron email dataset contains 
data from a subpopulation of about 150 users, mostly senior 
management of Enron. This data was made public by the 
Federal Energy Regulatory Commission during its investiga- 
tion of Enron. In total there are 36,692 nodes (unique email 
users) with average degree of 10.02 and average clustering 
coefficient of 0.5. The high clustering coefficient suggest 
great homophily in this network. The email corpus descrip- 
tion can be found in Klimmt and Yang |25j . The version of 
the email graph we use can be downloaded from the Stan- 
ford's SNAP repository [37 . 

We also make use of other social network datasets all 
of them, except Flickr, are available online at SNAP |37j . 
We now describe our datasets making use of A^, (k), and c 
to denote the number of nodes, the average degree, and the 
clustering coefficient, respectively: Epinions (N = 75,877, 
{k) = 10.7, c = 0.26) and Slashdot (N = 82,168, (k) = 0.1, 
c = 0.23) online social networks, Wiki-talk (N = 2,394,385, 
{k) — 3.9, c = 0.2) Wikipedia user-to-user discussion graph, 
EmailEU (N = 265,214, {k) = 2.8, c = 0.28) the network 
email communication between users of a large European 
research institution, Youtube (A^ = 1,134,890, {k) — 5.3, 
c = 0.17) friendship network of youtube.com users, and fi- 
nally, Flickr dataset, a snapshot of an online photosharing 
network with A" = 1,715,255 nodes and (k) — 12.2, collected 
in Mislove et al. [33]. 

We also contrast our social network results with the results 
on three non-social networks. These networks can also be 
found online at SNAP [37 . Gnutella (A^ = 62,561, {k) = 
4.7, c = 0.01) a collection of merged P2P client snapshots 
collected in Ripeanu et al. [43], HepTh (A^ = 27,770, (k) = 
25.4, c = 0.3) a paper citation graph, and Amazon (N — 
334,863, (k) — 5.5, c = 0.4) the network of co-purchased 
products on the amazon.com website. 

Simulation setup. 

Unless stated otherwise our metrics consist of averages 
over 1,000 simulation runs. We use colored shadows in our 
plots to show the value of standard deviation plotted around 
the average. The shadow serves two proposes. First, its ver- 
tical width multiplied by 1.96/VlOOO gives approximately 
the 95% confidence intervals of our averages. Second, its 
value measures the variability between independent runs, 
by which we compare how consistently good (or bad) an al- 
gorithm performs. In our simulations B{0) includes a single 
node recruited uniformly at random from V. The order in 
which neighbors of a node appear on its list of neighbors is 
randomized from run to run to avoid arbitrary biases that 
may arise from the choice of node IDs in the dataset. 



5. BFS & DFS ALGORITHMS 

We begin our study by comparing the performance of two 
different approaches derived from two basic graph traversal 
algorithms: Breadth-First Search (BFS) and Depth- 
First Search (DFS). BFS is chosen because it is widely 
used in network sampling [TT] [26l |33| [35]. In these algo- 
rithms nodes in the J\f{B{t)) are recruited according to the 
time they were first observed (a node is observed when one 
of its neighbors is recruited). If we consider that nodes are 
put in a queue when they are first observed and then re- 
moved when they are recruited, then BFS employs a First 
In First Out discipline for the queue, recruiting the first 
observed node in J\f{B{t — 1)), while DFS employs a Last 
In First Out discipline, recruiting the last observed node in 
J\f(B{t — 1)). At each step a new, previously unrecruited 
node, is recruited such that at step t — N — 1 all nodes are 
recruited i.e., N{B{N - 1)) = 0. 



Fig. 



2a 



shows the average cover size {W{t)) of BFS and 
function of t on the Enron email network (recall 



DFS as a 

that we average over 1,000 simulation runs). We find similar 
results on all of our social network datasets, see Figs. [6^-f. 
The simulations show while both BFS and DFS achieve the 
full coverage for t ^ N, BFS significantly outperforms DFS 
for all other values of t. To understand this difference, we 
qualitatively analyze the step in which a given node v ^ V 
with degree kv is recruited. 

Because both BFS and DFS follow edges to recruit nodes, 
the probability that v is first observed in J\f(B{t)) at step t 
is approximately {'-fv/N){l — 7^;/A^)*~^, where 7^ = kv/{k) 
(this simple formula should be a good approximation in 
a configuration model where nodes are recruited indepen- 
dently by both algorithms; it also assumes t <^ N). Thus, 
large degree nodes tend to be observed earlier in the process 
than small degree nodes. As a FIFO policy recruits the earli- 
est observed nodes from J\f{B{t)), BFS tends to recruit large 
degree nodes first, on the other hand, a LIFO policy recruits 
the latest observed nodes from J\f{B{t)), hence DFS tends 
to recruit small degree nodes first. This DFS result 
contradicts previous results in the literature [29], which we 
revisit in Sec. |9j 

Note that Fig. [2a| shows larger standard deviations for 
BFS than for DFS (although in some social networks the rel- 
ative difference may be small). This is because the cover size 
of a non-neglegible fraction of the BFS runs deviates from 
the average. This instability is due to the strong dependence 
of the BFS cover size on the initial node B{0) — {i}. As BFS 
explores the network in "waves" (expanding rings from i), the 
initial node selection may significantly impact BFS's cover 
size. Moreover, we expect BFS to perform poorly on 
networks with a large degree of homophily (as seen at 
the end of Sec. [6] in a simple regular lattice example). Ho- 
mophily is the tendency of individuals to connect to similar 
individuals [30 , thus creating patches of clustered nodes in 
the network. This means that if G is connected to u ^ V 
and z E V, then u and z are more likely to be connected 
than random chance would allow. In addition, if v is not 
connected to w E V then u and z are more likely than ran- 
dom not to be connected to it;. In such scenario it pays not 
to recruit both u and z together, as their neighbors signifi- 
cantly overlap with higher probability than random chance 
alone would allow. 

DFS clearly avoids the above homophily problem by travers- 
ing the graph in depth first order. To increase the cover set 
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size we only need to modify the LIFO recruitment policy 
without resorting to BFS's FIFO policy. In what follows 
(Sec. [6} we explore the use of Random Walks (RWs). As we 
see in the next section, a RW share commonalities with DFS 
in that it also traverses the graph from the last recruited 
node (however, a RW may try to recruit a node more than 
once). But, different from DFS, it allows recruitment of ob- 
served nodes in J\f{B{t)) irrespective of when the node was 
observed. One drawback of RWs that fortunately can be 
easily mitigated via caching is the possibility of recruiting 
already recruited nodes. 

6. RW ALGORITHMS 

Now let us analyze the cover size of Random Walks 
(RWs) with one- hop lookahead. Increased attention 
has been paid to random walks as a tool for network sam- 
pling [T3HT5I [27l |42] mostly due to its good statistical prop- 
erties. In the RW algorithm J\f{B{t)) is still the set of all 
observed unrecruited nodes at time t. However, in a RW the 
node to be recruited at step t-M is a random neighbor of the 
node recruited at step t, regardless of the time that the node 
was observed or even if it was already recruited. We begin 
our analysis assuming that a node that is recruited again at 
step t needs to be paid (that is, time advances even if no new 
recruitments were performed). We refer to this traditional 
RW algorithm as RW with replacement (RW), in which 
nodes already recruited can be recruited again. At the end 
of this section we extend our analysis to the case of RW 
without replacement (RWnr) where already recruited 
nodes are "cached" so that recruiting nodes from B{t) does 
not count towards the recruitment budget. 

The cover size of RWs with one-hop lookahead has been 
the subject of previous work ^SlJ. However, we feel that one 
needs to exercise caution when interpreting the results in Mi- 
hail et al. |31j . Mihail et al. shows that a RW with one- hop 
lookahead finds the majority of nodes in sublinear time in 
an infinite configuration model with heavy tailed power law 
degree distribution. As our approach demonstrates below, 
covering finite networks is patently different from covering 
infinite networks. In particular, we show that for any given 
finite sized network, the discovery rate is never superlinear 
(this linear growth rate, however, can be large). Our model 
also allows us to predict with high accuracy the expected 
number of covered nodes as a function of < t < n. 

Let us first analyze the performance of RW. In RW, the 
expected cover size at step t is 



{W{t)) — N — P[node v is uncovered] 



(1) 



where q is a vector with the initial distribution of the ran- 
dom walk, 1 is a column vector of ones, and _v'a(t;)P is a 
taboo transition probability matrix defined by 



Ua{v)^V3 - { 0, Otherwise, 



with Afa{v) denoting the neighborhood set of node v, includ- 
ing node V. 

The above formula ([T]) requires complete topology knowl- 
edge and does not allow simple analytical solution. How- 
ever, consider the following approximation to a RW. Nodes 



are recruited with replacement in i.i.d. fashion according to 
the stationary distribution of the random walk. Then, the 
expected cover size at step t would be given by 



{W{t)) — N — P[node v is uncovered] 



N 



(2) 



Vvev 



where 




kv -\- ^ ^ kj 



The above can be interpreted as a particular case of the 
coupon subset collection problem [2||32]. Each step t, t — 
1, . . . , — 1, we draw a subset of "coupons", a subset of 
newly observed nodes in our terminology. We can observe 
a node either by sampling it directly (this corresponds to 
the term kv/{2M)) or by sampling one of its neighbors (this 
corresponds to the term YlijeMa{.v) ^j/(2^))- The value of 
kv + Yli-iG.i\r^(v^ is known as the second neighbor degree y[j. 
In Appendix |A| we use matrix perturbation theory to show 
that ([2]) well approximates ([T]) for fast mixing RWs (see [H 
[42] for fast mixing RW techniques). 




Figure 3: (Enron email) Theoretical RW cover (red line) 
against simulations (blue circles) on Enron email network. Plots 
in semi-log scale. 

Applying the Taylor series expansion and the fact that 
2M — {k)N, we can write - for small t - the following 
approximation 



{W{t)) 



Next, we note that J^w^v T^j^Afaiv) ^3 
which yields 



{W(t)) « t 



jk') + (fc) 
(fc> 



where (/c^) is the empirical second moment. Thus, the cover 
process has approximately linear large growth rate in the 
initial phase, if the second moment (/c^) is large with respect 
to the average (k). Next, let us treat ([2| as a continuous 
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function of t > 0. The second derivative with respect to t is 



{W(t)) = - - a.Y log'(l - a.) < 0, 



which is a concave function and, consequently, for any fixed 
network, the growth rate of the cover of RW with one-hop 
lookahead is at most linear. Figure [s] shows that our theoret- 
ical result in ^ accurately refiects the cover in simulations 
over the Enron email dataset. 

Next we analyze the RWnr algorithm. Here, when the 
walker comes across an already recruited node, it does not 
recruit it again but rather randomly proceeds to one of its 
already cached neighbors. We define the algorithm evolution 
such that at each step we recruit a new node (that is, "virtual 
re-recruitments" do not advance t) . RWnr closely follows the 
derivations for RW, with the only difference that now some 
recruitments cannot be "wasted" on already recruited nodes. 
The key observation here is that we need to keep track of the 
edges belonging to recruited nodes. We suggest the following 
approximation. Let Z(t) be the number of edges not yet 
discovered at time t, then we approximate (details can be 
found in Appendix Ib) 



{W(t)) ^ N 



vev 



2{Z{i)) 



where 



(2(t)> 



{u,v)EE i=0 



ku ~\~ ky 

WW) 




From (jsj we note that RWnr and privileges nodes with large 
second neighbor degrees (just like RW does). 

The plots in Fig. [2a| show that RWnr significantly outper- 
forms RW for large values of t. Note that for small values 
of t, the graphs in Fig. [2a| shows that the performance of 
RW is similar to the performance of RWnr. This behavior is 
observed because for small values of t, {Z{t)) ^ {k)N, and 
thus the values of {W{t)) in eqs. ^ and ^ are similar. 

The plots in Fig. [2a| also show that RWnr significantly 
outperforms BFS. These results are consistent on all other 
datasets (refer to the plots in Figs. |6^-f) This curious fact 
can be explained by observation that BFS is affected by the 
graph homophily. 

At first sight, however, it seems that both RW and RWnr 
should be impacted by homophily, as both have a tendency 
to "over-explore" a neighborhood, just as BFS. However, as 
we see next, this depends on the network and, in social net- 
works, we believe that this is not likely the case. We contrast 
the performance of RW against BFS on two regular lattices. 
We choose these graphs because on regular lattices {k^) / {k) 
is small, leaving just homophily (for BFS) and the RW es- 
cape probability (the probability that a RW never revisits 
the same node or neighborhood again in an infinite network) 
as the primary factors in determining cover size for BFS and 
RW. Fig. [4] shows the performance of RW and BFS on regu- 
lar 2D and 3D lattices. The plots show the cover size W{t) 
as a function of t where both lattices have approximately 
= 10^ nodes each. Observe that RW suffers from its 
tendency to return to the same nodes. On an infinite 2D 
regular lattice a RW returns to the same node with prob- 
ability one while in a regular 3D lattice this probability is 
0.34 [47^. On the other hand, BFS is affected by the cluster- 
ing of the lattice, such that for most recruited nodes BFS in 



average covers approximately only one new node per step. 
A detailed analysis of these approximations can be found in 
Appendix [C] 

7. SI ALGORITHM 

In this section we consider a different method, inspired by 
the Susceptible-Infected (SI) model in epidemiology: at 
step t recruit a node from J\f{B{t)) by randomly selecting one 
of the edges between B{t) and J\f{B{t)). Under SI node v G 
N{B{t)) with d{v,t) neighbors in B{t) is recruited at time 
t with probability d{v,t)/ "^^^j^^^j^^^^-^-^ d{u,t). Our analysis 
of RW-based cover size relied on the fact that a node is 
always recruited with probability proportional to the degree 
of that node. This is no longer the case for Sl-based cover 
sizes. Moreover, unlike the Sl-related epidemic literature on 
infinite graphs |5J , we will observe from our analysis that the 
probability that a particular node is recruited depends on t, 
the number of steps that have been executed. 

Before we delve into the analysis of SI, we first show that 
we cannot ignore the impact of t on the degree distribution 
of nodes in B{t), which, as we see next, becomes significantly 
different than the degree distribution across the whole net- 
work {pk}k=i,... as t gets larger. 

7.1 The effect of t on the degree 
distribution of B{t) 

In order to model the evolution of B{t) and N{B{t)) we 
need to understand the impact of SI recruitment policy on 
the degree distribution of the nodes still left to recruit, B{t). 
Fig. [5] shows the empirical Complimentary Cumulative Dis- 
tribution Function {P[K > k]), denoted CCDF in the plots, 
of nodes' degrees in B(t) using SI over the Enron email net- 
work for t G {1000, 2000,4000,8000,16000} and against the 
CCDF of all the nodes in V. The empirical CCDF is aver- 
aged over 1 000 runs. Observe that even when t is reasonably 
small, e.g., t = 1,000, the tail of the CCDF of B{t) is still 
significantly "lighter" than the tail of the CCDF of V. This 
is because large degree nodes are more likely to be recruited 
early to B{t); and as B{t) is depleted of large degree nodes, 
the tail of the degree distribution of B{t) gets "lighter". We 
use this property when analyzing the cover performance of 
SI. 

7.2 Analysis of SI cover 

We start our analysis by characterizing the evolution of 
the cover size as a function of t. Section 17.11 shows that 
the number of remaining nodes in B{t) with degree k — 
1,. . . ,N — 1 is a function of t and k. Hence, our analysis 
divides the recruited nodes in B{t) into classes correspond- 
ing to different degrees. In particular, using mean field ap- 
proximations we characterize bk(t), the fraction of nodes of 
degree k in G that are recruited by time t, k — 1, . . . , N — 1. 
As the number of nodes of degree k in B{t) is given by 
Npk{l — bk{t)), the degree distribution of B{t) can be ap- 
proximated by {Cpk{l — bk{t))}k=i,...,N-i, where C is a nor- 
malizing constant. We now characterize the connections be- 
tween nodes of various degrees between B{t) and B{t). In 
the configuration model described in Sec. [2] - the proba- 
bility that a given node u E V oi degree k is connected to a 
randomly chosen node X E V of degree kx — h is 



Pkh 



-) 

2M J 



6 




100 200 500 2000 5000 20000 100 200 500 2000 5000 20000 



t (steps) t (steps) 



(a) (b) 

Figure 2: (Enron Network) Empirical average cover size {W{t)) as a function of t E {1, N — 1}. Fig. |2a| compares Oracle, RW, RWnr, 
BFS, DFS and Fig. [2b] compares Oracle, RWnr, SI, BFS, MOD. Shadows show double the standard deviation of 1,000 simulations; x-axis 
in log-scale. 




(a) 2D Lattice 




Figure 4: {W{t)) of RW against BFS on a 2D and 3D lattices. Simulations average over 1,000 runs, standard deviation shown 
colored shadow. Plots in log-log scale. 
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Figure 5: (Enron Network) Complementary Cumulative Distribution Function (CCDF) of B{t) averaged over 1,000 runs, for t G 
{1000, 2000, 4000, 8000, 16000} recruited nodes in the Enron Network (Enron has approximately N = 36,000 nodes). As SI recruits nodes 
in the graph, the degree distribution of the remaining nodes in B{t) suffers a dramatic change. Plot in log-log scale. 



The probability that X G B{t) given kx ^ h is P[X G 
B{t)\kx — h] — bh{t), from which we approximate the prob- 
ability that an infected node u of degree k has an infected 
neighbor of degree h by 

P[X G B{t) , X G Afaiu)\kx ^h,ue Bit), ku^k]^ hh{t)pkh. 

Note that the above approximations assumes that P[X G 
B{t)] does not depend u G B{t) (a reasonable approximation 
if X is randomly chosen from a large population of Nph 
nodes in the graph) . The probability that no degree h node 
connected to u has been recruited at time t can be approx- 
imated by (1 - 6h(^)m)'^^^ If we condition on the event 
u G B{t), then the probability that u has at least one re- 
cruited neighbor is Y\hO^ ~ bh{t)pkh)^^^ ■ Take notice that 
unconditioning the above on P[u G B{t)\ — {l — bk{t)) yields 
Tfc(t), the probability that at time t a randomly chosen node 
of B{t) of degree k is in J\f{B{t)), given by 

Tkit) = (1 - bk{t)) (^1 - Y[{1 - bH{t)pkhf''^^ . 

Finally, the expected number of observed nodes at time t + 1, 
{N{B{t + 1))}, is approximately 



{N{B{t + l)))^^NpkTk{t). 



(4) 



Appendix|D]contrasts Q against previous works, e.g. Pastor- 
Satorras and Vespignani 41 . 

Now we derive an equation for the dynamics of the number 
of sampled nodes B{t). When we sample a new edge {u,v) 
with u G B{t) and v G N{B{t)) from the frontier between 
B{t) and J\f{B{t)), the probability that v has degree k is 
proportional to kpk{l — bk{t)). If we divide it by PkX, we 
get the average increase in the fraction of sampled nodes of 
degree k, then: 



bkit + 1) ^ bk{t) + 



k{l-bk{t)) 



NT.hhPh{l-bH{t)) 



(5) 



It is easy to check that 

{B{t+l)) = 5];iVpfe6fe(t+l) = = {B{t))+l. 

k k 

we guarantee that at every step a new node is recruited. 

We now contrast our mean field approximations against 
simulations. Fig. |7| shows {H{B{t))) calculated according 
to Q against the empirical value obtained from our sim- 
ulations over two datasets, Enron and Gnutella. We plot 
the results in log-log scale to facilitate the comparison in 
respect to the relative error. Note that our approximation 
tracks the simulation results very well. Fig. |2b| shows the SI 
cover size against the cover size of BFS and RWnr on the 
Enron email network. Note that the SI cover size is larger 
than the BFS cover size but smaller than that of RWnr. 
These results are consistent on all of the datasets analyzed, 
see Figs. [e^L-f, (albeit sometimes the differences are small). 
The takeaway message from Fig. [2b] and all the results from 
the plots in Figs. |6^-f is that there is much room for im- 
provement. And while RWnr clearly outperforms all other 
methods in the Enron dataset, this gain all but disappears in 
other datasets (Epinions, Slashdot, and Flickr). In what fol- 
lows we present an algorithm that consistently outperforms 
all of the algorithms studied so far. 

8. EXPECTED EXCESS DEGREE 
MAXIMIZATION ALGORITHM 

One lesson to take away from Guha and Khuller's Oracle 
is that knowing which node in N{B{t)) has the largest ex- 
cess degree is crucial to achieving a good cover. While the 
excess degree of nodes in J\f{B{t)) is not available to us, we 
may still be able to estimate them from the available infor- 
mation. Let d{v,t) be the observed degree of v at step t. 
Consider a large degree node v G B{t). We expect v to also 
have a large observed degree d{v,t). However, depending 
on the degree distribution of the nodes in Af{B{t)), there 
could be a "saturation point", where the observed degree 
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of d(v,t), is so large that the excess degree kv — d(v,t) 
is small. In this section we propose an algorithm, denoted 
Maximum Expected Excess Degree (MEED), that at 
step t > finds a node in Af{B{t)) that has approximately 
the largest expected excess degree of all nodes in J\f(B{t)). 
In what follows we drop t from our notation for the sake of 
conciseness. 

Let {k\d{v)) denote the expected degree of a node v G 
J\f{B{t)) with observed degree d{v). Using {k\d{v)) we pro- 
pose the Maximum Expected (i-Excess Degree (MEED) heuris- 
tic, which chooses the next recruited node v* at step t as to 
maximize the expected excess degree of the recruited node. 
Another way to describe the MEED scheduler is through a 
partial order of the nodes in J\f{B{t)) in respect to their ex- 
pected excess degree. For two nodes u,v ^ J\f{B{t)) we say 
u>kv iS {k- d{u)\d{u)) > {k - d{v)\d{v)). Thus, at step t 
the MEED algorithm recruits node G N{B{t)) iiv* >k v, 
yveAf{B{t)). 

Next we obtain an approximation of {k—d\d) using {pk}k=i,.. 
For now we assume {pk}k=i,... is given to us. Later we 
show that for some important families of random networks, 
the node with maximum observed degree is also the node 
with the maximum expected excess degree. Let Ck{t) be 
the probability that a random node in B{t) has degree k. 
Note that Cfe(O) — Pk as, by definition, the initial node 
in B{0) is randomly sampled from V. In general we have 
Ckit) — Cpk{l — bk(t)) (as stated in Sec. [7|, where C is a 
normalization constant. In what follows we omit t for the 
sake of conciseness. 

Let J\f^'^\B{t)) C B{t) denote the set of nodes with at least 
d recruited neighbors. Note that Af^^\B(t)) = AfiB{t)) and 
that Af''^\B{t)) = B{t). Under the configuration model we 
can determine {Af^'^\B{t))}d=i,... and W{t) through t_he fol- 
lowing process that dynamically assigns nodes from B{t) to 
these sets. Let's assume ^ 1 so we do not need to worry 
about self-loops. Detach all nodes v e V from their neigh- 
bors such that node v with degree kv has kv "active stubs'']^ 
Iteratively select an active stub in B{t) to a random active 
stub in V. Whenever an active stub of a node u G Af'^^^t), 
(i G {0, 1, . . .}, is selected, we add v to Af'''^^^\t) and mark 
both stubs of the edge "inactive", that is, we promote u to 
J\f^'^^^\t) but reduce its active degree by one. The following 
recursion describes the degree distribution {Cjf^'^^}k=d+i,... 
of the nodes in J\f'''^^^\t) in terms of the degree distribution 
{Ci^'^jk^d,... of nodes in J\f^^\t) 



Ad+i) 
^k 



(k 



(fc> 



k> d, 



(id) 



(6) 



(d) 



for /c < d, where {k)^(d) = ^kxiHtf^ 
egree of nodes with at least d recruit"ea (black) 



and Cfc 
the average degree 

neighbors. 

We now retrieve {k\d) from {k)^(d) using the fact that 
j^(o) 3 ^(1) 3 . . . . Let iVd be the number of nodes in 
Af{B{t)) with observed degree d. Note that iVi = ^{B{t)). 
For any two sets A, A', such that A' C A, the following 
holds: vol(A - A') = vol(A) - vol(A'), where vol(S) is the 
volume of the set B, i.e. the sum of the all the degrees of 
the nodes in B. Considering A = Af^"^^ and A' = Af(^+^^ it 



This stub analogy is extensively used to describe configu- 
ration models .40.. 



is easy to show that 



{k\d) 



Nd - Nd+i 



We approximate the expectations of {Nd/{Nd — Nd+i)) and 
{Nd+i/{Nd — Nd+i)) using the observed values of Nd+i and 
Nd. 

Our calculations of {Nd/{Nd - Nd+i)) and {Nd+i/{Nd - 
Nd+i)) should be used with caution as they do not consider 
the extra density of connections inside B{t) created by the 
MEED recruitment process. Taking this bias into account 
is not trivial and is the subject of future work. However, 
in our MEED simulations we observe that Nd+i <^ Nd for 
large d, and, under such scenario, it is reasonable to make 
the following simplification {k — d\d) ^ {k — d)^{d) . 

Unfortunately, obtaining {k — d)^(d) still requires know- 
ing Ck^\ yk,d, which in turn requires knowing the degree 
distribution. Note, however, that MEED simplifies to an al- 
gorithm that always selects the node v* with the maximum 
observed degree in Af{B{t)) if {k - d{v*)\d{v*)) > {k - d\d), 
d ^ 1, • • . , d(v*). We denote this simplified MEED heuristic 
Maximum Observed Degree (MOD). 

We now see if we should expect to find the property {k — 
d{v*)\d{v*)) > {k — d\d), d — 1, . . . , d{v'*') in real social net- 
works. In what follows we show that, under certain con- 
ditions, two of the most relevant social network models - 
power law and Eros-Renyi random networks have this de- 
sired property. We later complement these findings with 
simulation results in the Enron network. In Appendix [E| we 
show that 



{k - c?>^(d) 



dzd+ 



tH{z) 



Qd 



iiH{z) 



rf> 1, 



(7) 



where H{z) — Ylik ^''Cfe is the probability generating func- 
tion (p.g.f.) of Cfe, and d'^H{z)ldz'^ is the d-th derivative 
of H{z) with respect to z. Next we use (14) to obtain ap- 
proximations of {k — d)^(d) for Erdos-Renyi and power law 
networks. 

8.1 Maximizing d-excess Degrees in 
Erdos-Renyi Networks 

In an Erdos-Renyi (ER) graph G{N, q) the degree distri- 
bution is Binomially distributed, = (^^^)^''(l-^)^~^~^, 
/c = l,...,A^— 1. We now consider the case where t is small 
enough such that C,k{t) ^ pk- Then, the probability gener- 
ating function of (k is H{z) — (1 — g + qz)^~^. In such 
scenario applying (H^ yields 



As A" ^ 1 we observe that the average (i-excess degree 
can be approximated by the average degree of the network, 
{k — d)^(d) — {k), independent of d. Thus, if A'c^+i <^ Nd, 
yd, the scheduler can be degree agnostic and, thus, MOD 
approximates MEED. 

8.2 Maximizing d-excess Degrees in 
Power Law Networks 

In a power law network we cannot ignore the effect of t on 
Ck{t). Hence, we consider Ck{t) to be power law distributed 
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with an exponential cut-off. The exponential cut-off approx- 
imates the behavior observed in Fig. [S] Moreover, in a vari- 
ety of real world networks the degree distribution can be well 
approximated by power law distributions with exponential 
cut-offs ^^38,. Let 

where Ct < 1 is a parameter that depends on B{t) and t 
and the normalization factor Lih{x) — J^fcli /k^ is the h- 
th poly logarithm function of x. The probability generating 
function of C,k assumes the form H{z) — LiT-(2;Ct)/Lir(Ct). 
In Appendix [e] we use H{z) to show that: 

• If T = 1, (/c — d)^{d) ~ i^-Ct ' ^^^^ node with the 
largest observed degree should be recruited. 

• If T = 2, (/c - d)^id) ^ Td , where < F^ (^) < 
(i + 1 , which implies that for {k — {d + a))^(d+a) > {k — 
{d))^{d) , a = 2, 3, . . . and d — 1, . . .. Thus, at step t + 1 
we should recruit the node v G M{B{t)) with either the 
largest or second largest value d{v). We believe that 
these bounds can be improved to show that the node 
with the largest observed degree should be recruited. 
An interesting observation is that {k — d)^(d) increases 
with Ct and diverges as ^ 1. 

• With Ct ^ 1 and r > 0: The case where Ct ^ 1 repre- 
sents the case where {pk} is a pure power law distribu- 
tion and t has little impact on the degree distribution of 
B{t). This case is only of theoretical interest as no real 
world network degree distribution can match an infinite 
support power law degree distribution. In this scenario 
{k — d)^(d) < cxD, V(i < [r — 1], and {k — \t])^(It^) oo. 
This means that the node with observed degree degree 
at least [r] should be recruited. We believe that this 
result can be strengthen to show that {k — d\d) mono- 
tonically increases with d. 

Thus, we conclude that under the above networks MOD is 
a good approximation to MEED. 

8.3 Simulations, Excess Degree & MOD 

The above analysis suggests that in some power law net- 
works and in ER networks, {k — d\d) increases with d. An im- 
portant question is whether we observe this phenomenon in 
practice. Fig. [s] shows estimates of {k — d\d) ioi d ^ {1^ 2, 4} 
obtained from simulating SI on the Enron network (these 
estimates are averaged over 1,000 runs). We choose SI to 
estimate {k — d\d) instead of MEED as under MEED very 
few nodes with observed degrees greater than two remain in 
J\f{B(t)), t > 0, thus making the estimates unreliable. In 
the figure we observe that for t approximately in the range 
{l,...,2/3iV} wehave (A;-l|l> < {k-2\2) < (/c-4|4> and 
for t approximately in the range {2/3A^ -hl,...,A/' — l}we 
have(/c-l|l> < {k-2\2) ^ (/c-4|4>. Hence we expect that 
MOD is a good candidate to approximate MEED on Enron 
network. 

Fig. [2b] and Figs. [6^-f show that the MOD heuristic out- 
performs, sometimes significantly, all previous lookahead one 
algorithms on all social network datasets. We, however, still 
notice a significant gap between the Oracle and MOD, which 
we believe can be reduced using side information to improve 
the estimation of {k — d)^(d). 



9. SUMMARY, CONCLUSIONS & 
RELATED WORK 

We have considered the problem of providing an online al- 
gorithm that, by recruiting nodes through their neighbors, 
greedily maximizes the network cover of an online social net- 
work. In our setting the network topology was unknown 
and the only topological information available came from 
the identity of the neighbors of already recruited nodes. 

In this scenario, we have evaluated the efficacy of exist- 
ing network sampling algorithms (BFS, DFS, RW) and pro- 
posed a novel algorithm, Maximum Expected Excess Degree 
(MEED), inspired by the greedy approximation to the min- 
imum connected dominating set of Guha and Khuller [16 , 
which uses two-hop lookaheads (and, thus, denoted "Ora- 
cle" in this work) to recruit at every step the node with the 
largest excess degree. The MEED heuristic seeks to maxi- 
mize the expected excess degree of nodes with the help of 
degree distribution side information (if available) . In the ab- 
sence of degree distribution information, we have shown that 
on random power law and Erdos-Renyi networks MEED can 
be approximated by MOD (Maximum Observed Degree), a 
greedy heuristic that at every step recruits the node with the 
largest observed degree. We have shown through extensive 
simulations on real world social network datasets that MOD 
outperforms all other algorithms, often quite significantly. 

We have also provided theoretical analysis of RWs (with 
and without replacement) and of an algorithm inspired by 
the Susceptible-Infected epidemic model, which we denoted 
SI. Our theoretical analysis, to the best of our knowledge, 
stands as a contribution on its own. We expect that our 
formulas can aid practitioners in predicting the cover sizes of 
these algorithms when degree distribution side information 
is available. 

Finally, we have uncovered a puzzling previously unknown 
fact about DFS: DFS performs remarkably poorly on social 
networks. In fact, DFS seems to avoid recruiting nodes with 
large excess degrees. We have argued that this is due to its 
tendency to keep large degree nodes at the bottom of the its 
recruitment queue. We note in passing that this property 
of DFS may find applications in undercover military opera- 
tions where one seeks to recruit target individuals with the 
minimum exposure (number of connections) to unrecruited 
targets. 

Related Work. 

The connections between our work and the literature on 
MODS were already presented in Sec. [S] In this section we 
review the remaining related literature. The work most re- 
lated to ours is Maiya and Berger-Wolf ^D]- Maiya and 
Berger-Wolf presents a simulation study of the cover sizes 
(among other metrics) of different algorithms, including BFS, 
DFS, Oracle (which they denote Expansion Sampling), and 
MOD (which they denote Sample Edge Count). Their work 
considers social (e.g., Enron email) and technological (e.g., 
Amazon product co-purchase) networks. Surprisingly, their 
conclusions are remarkably different than ours, arguing that 
DFS outperforms BFS, RWs, and, most importantly, MOD. 

Maiya and Berger-Wolf ^ shows the performance of 
these algorithms. Figs. 4(e) and 4(f) in their work, on the 
(non-social) HepTh and Amazon networks (see Sec. [4] for a 
brief description of these datasets). In order to understand 
the discrepancy of our conclusions, here we also include sim- 
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ulations of these same two datasets. We show our results in 
Figs. [9a| and [9b| Note the remarkable difference to our social 
network results in Figs. and Figs[6^-f. Indeed, DFS ex- 
periences a great improvement in performance while MOD 
worsens dramatically. 

To test whether the sudden improvement of DFS and de- 
cline of MOD can be mostly attributed to the more struc- 
tured nature of these two networks in respect to our social 
networks, we artificially add randomness to these network by 
randomly rewiring all endpoints (while making sure nodes 
form a single connected component). Figs. 9c and 9d pro- 
vide the results over these randomized networks. The results 
are clear, adding randomness makes MOD incontestably su- 
perior to all other algorithms (in HepTh it even matches 
the performance of Oracle) and DFS is again noticeably in- 
ferior. It remains an open question whether, in respect to 
the network cover problem, most social networks are more 
similar to "random networks" or more similar to "structured 
networks". Our datasets and simulation results suggest the 
former. 

In an earlier preliminary work (Lim et al. ^45j) we pro- 
posed SI under the name Randomized Expansion Sampling 
(RXS) used to find the most central nodes in a network. 
However, in Lim et al. we did not analyze SI cover size. 
Random walks with lookahead have been the subject of a 
number of works. Cooper and Frieze |8j studied the cover 
time of RW. Mihail et al. [31] shows that a RW with one-hop 
lookahead finds the majority of nodes in sublinear time in 
an infinite configuration model with heavy tailed power law 
degree distribution. Our analysis, however, shows that the 
cover rate is at most linear for any value of t. Adamic et 
al. [J proposed a RW with two-hop lookahead and analyzed 
its cover time on an infinite network. The fast cover of RWs 
has been used in the context of decentralized search, e.g., 
when searching for content on unstructured P2P networks 
(see [20, 24, 28, 44, 46 and references therein). 

A closely related problem is the influence maximization 
problem. The infiuence maximization problem considers 
that each recruited individual invites its neighbors who can 
be recruited with some probability. The purpose is to select 
a set "infiuential" individuals, in order to cause a cascade of 
recruitments in the network. Network topology is generally 
assumed to be known 22^ . This problem was first proposed 
by Domingos and Richardson (please refer to the review 
in Kleinberg [23] for other references). Directly related to 
online recruitment, Hartline et al. [17] and Bhattacharya et 
al. [6] analyze the (paid) recruitment of consumers of a prod- 
uct under the assumption of perfect network knowledge. 

APPENDIX 

A. RW & INDEPENDENT EDGE SAMPLING 
APPROXIMATION 

Let us assume that the random walk starts from a station- 
ary distribution {tTv — kv/{2M)). This is not too restrictive 
assumption, since in the configuration model after the first 
step the random walk is asymptotically with respect to the 
network size in the stationary regime. And in the general 
case, we can add auxiliary uniform jumps, which has been 
shown to significantly reduce mixing time. 

We now show that if the random walk starts in steady 
state, that is Qv — tt^, the expression using the complete 
topological information ([T]) can be approximated by expres- 



sion ([2]). It is enough to consider just one term in ([T]) cor- 
responding to one node. Without loss of generality, we take 
V — 1 and assume that nodes u = 2, /ci + 1 are neighbors 
of node 1. Partition the stationary distribution tt as [tti 7T2], 
where tti corresponds to node 1 and all its neighbors and 
7T2 corresponds to all the other nodes. Then, we need to 
evaluate 

Let us demonstrate that 7T2 properly normed is close to 7r2, 
the quasi-stationary distribution of the substochastic matrix 
P22: 

7T2P22 — X7T2- 

Even though P22 is substochastic, if the graph is large enough, 
P22 will be close enough to a stochastic matrix so that we 
can apply perturbation theory. Let us consider the following 
perturbation equation 

(4° )+£4'^+...)(^22-£l)) = (l-£A^')+...)(4°)+£4')+...), 

(8) 

where S22 is the stochastic complement (it describes the 
transitions of the censored Markov chain) and 

sD^P2i[I-Pii]-'Pi2, 

with £ as some scaling parameter. Equating terms in ^ 
with we obtain 

(0) e (0) 

from which it follows that tt^^ — c7T2. Then, equating terms 
in (jsj with we obtain 

(l)c (0) n (1) \(1) (0) 

022 — D ^ — X ^2 • 

Multiplication the above equation by 1 from the right yields 



7r2l 



T12DI. 



and consequently. 



A ^ 1 - e\^^^ = 1 - ^7r2P2i[/ - Pii]"'Pi2l. 
7r2l 



Next, from the defining equations for the stationary distri- 
bution, we have 



which leads to 



7r2P2i[/-Pii]"' =7ri, 



A ^ 1 -TTlPlsl. 

7r2l 



First, we note that 7r2l ~ 1. And second, we note that 
7riPi2l is the number of links from the neighborhood set 
Af{l) to all the other links divided by the total number of 
links. Since the number of links to the outside of the neigh- 
borhood set is much large than the number of links inside 
the set, A ~ (1 — ai) and expression ([TJ can be approximated 
by ([2}. 

B. RW WITHOUT REPLACEMENT 

Here we provide a brief description of the steps used to 
derive the approximation in equations ([s]). 

As in the case of the analysis of RW with replacement, 
we assume that we sample nodes i.i.d. fashion according 
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to the stationary distribution. Recall that Z(t) denotes the 
number of edges not yet discovered at time t. Then, given 
the history {Z{i),i = 0,1,.. - 1} with Z(0) = M, the 
probability that the node v is uncovered at time t is given 
by 

P[veW{t)\{Z{i),i^l,...,t-l}] 



n 

i=0 



2Z{i) 



kv -|- / /c-j 



Similarly, given the history {Z(t), t = 0,l, — 1}, we can 
calculate the probability that the link (u, v) is yet uncovered 
at time step t 

P[link (n, v) is uncovered at t\{Z{i), i — 1, t — 1}] — 



2Z(i) 



Viewing the link coverage problem as a subset coupon col- 
lector problem, we obtain 



(z(t)i{z(i),i=i,...,t-i}>= E n(i 



(u,v)eE i=0 



kii ~\~ ky 
~2ZW 



If we use the above equation recursively starting at Z(0) — 
M and using the approximate value of {Z{t)), we obtain the 
second equation in the approximate formulas ([s}. 

C. BFS ON 2D AND 3D GRIDS 

We first observe that BFS discovers the graph by consid- 
ering progressively large spheres. At step k the boundary of 
the set of discovered nodes is made by all nodes k hops away 
from the first node. When nodes are embedded in a metric 
space, the sphere has the property to be the solid with the 
smallest surface for a given volume. This property justifies 
intuitively why BFS discovers slowly new nodes. 

We first consider an infinite 2D grid. For the 2D grid there 
are 4t nodes at the boundary at step t, then every node on 
the boundary contributes on the average to add 1 + l/t new 
nodes when it is visited, even if each of them has degree 4 
and at least 2 neighbors not discovered. 

For a 3D grid the boundary has size 2 + 4t^, then ev- 
ery nodes contributes to discover (1 + 2(t + 1)^)/(1 + 2t^). 
Here a node on the boundary has 6 neighbors and at least 3 
not discovered at the begin of the step, but still its average 
contribution converges to 1 as t diverges. 

The reason for this slow increase is due to the fact that 
1) nodes at the boundary have many connections (roughly 
half of them, i.e. {k)/2) to the interior of the sphere, i.e. to 
already explored or discovered nodes, but also 2) the out- 
going connections point to the same nodes. In fact almost 
every node that is going to be discovered will be discov- 
ered through {k)/2 connections (those pointing towards the 
sphere). Then, as the sphere becomes larger and the border 
effects negligible, every node only contributes by discovering 
one new node. 

D. SI ALGORITHM VS. SI EPIDEMIC 

The literature on SI epidemic models is so vast and rich 
that we dedicate a section of our appendix to contrast our 
results against previous works. We also provide limited 



commentary on approximations of our equations. We first 
note that the previous literature is often interested in a 
continuous-time version of our SI model. Recall that in our 
scenario t is the number of recruited nodes, while the related 
literature considers t as time [34, 41 , where at time t the 
average number of "infected nodes" may be smaller or larger 
than t. 

However, it is an interesting exercise to try to connect the 
framework provided by Pastor-Satorras and Vespignani [41] 
with our approachj^ Consider a continuous-time SI epi- 
demic on G. In an SI epidemic an infected node can be 
thought play the role of a recruited node and susceptible 
nodes play the role of non-recruited nodes. Let A be the 
per-unit time infection rate, that is, an infected node con- 
taminates (recruits) a susceptible node during a time inter- 
val A ^ with probability AAt, regardless of the state of 
the infection. Let t' denote the wall clock time of the SI 
epidemic process and Pk{t') be the probability that a node 
with k links is infected. Then [41 j . 



dpkjt'] 
dt 



^xk{i-pk{t'))e{p{t')), yk, 



where p{t') = {pi{t'), . . . ,pN{t')) and 

eip{t'))^{k)-'Y.kpkPkit') 



(9) 



(10) 



is the probability that any given link points to an infected 
node. Note that our t is the number of infected nodes, and 
thus, 



^NpkPk{t')^t. 



(11) 



Analytically adding the constraint ([9| into the set of equa- 
tions in (11) is not trivial, but it can be done numerically. 
Let p'{t) be the solution of ( 11 ) with the added constraint ([o}. 
Still, even wit h p' {t) it is unclear how the cover size can be 
derived from (10). The main difficulty is the mapping be- 
tween wall-clock time and number of recruited nodes. Our 
formulation in Section [T] solves this problem by avoiding for- 
mulation the problem in terms of wall-clock time. 

E. THE EXPECTED i:>-EXCESS DEGREE 
FROM THE P.G.E OF (k 

Our analysis of {k — d\d) begins by breaking down {k — 
d)^(d) into the derivatives of the generating function of (k- 



{k-d)^,a,^J2(^-dK 



(d) 

k 



k=d 



= E 



h{h + i)ci%-» 

(fc>C(.-i) -(d-l) 



nti-2((fc>cc)-i) 
-i:H^oh{h+i)---{h+dKM 

nto((fc>c<-)-») 

1 d'^+^H{z) 



n£o(«cc)-') 



(12) 



The authors would like to acknowledge N. Perra and A. 
Baronchelli for helpful discussions on this topic. 
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where H(z) — J2k ^^Cfc is the probabihty generating func- 
tion of (k- The first several equahties are a consequence of 
successive apphcations of ([6|. The last equality comes from 
the applying the d + 1-th derivative to H{z). Multiplying 



both sides of ([12| by U'iZo ((k) ^(^) - i) yields 



(13) 



where {k)^(o) — J^k^Pf^- Substituting (13) into the denom- 
inator of the l.h.s. of ( 12 ) yields 



{k - d)^(d) 



ddH(z) 
dzd 



> 1. 



(14) 



An important property of the polylogarithm function is that 
for any constant C > 0, 



dUr{Cz) 



LI 



(Cz) 



dz z 
IN what follows we consider three special cases: r = 1, 
T = 2, and Ct 1. 

Case T — 1: Let's first consider the case r = 1, which 
implies that C,k is very heavy tailed. We use the fact that 
Lii(a) = — log(l — a). With r = 1 we have 

dH{z) _ 1 
dz ~ log(l - Ct) 

and thus, 

d'^Hi.z) 



dUi{Ctz) 



dz 



\og{l - Ct){l - Ctz) 
{d-l)\Ct 



dzd 



which yields 



{k - d)^(d) 



log(i-a) (i-CtY 

Ctd 



i-a' 

showing that the (i-excess degree grows linearly with d when 
T — 1. Then, we conclude that recruiting the node with 
the largest observed degree from Af{B{t)) maximizes the ex- 
pected cover increase. 

Case T — 2: Another case of interest is r = 2. In this case 
note that 



(fc - <i><(^) ~ iTr 



,^(-\og(l-Ctz)lz) 
Using the Taylor series expansion of 

ih h-i 



log(l - Ctz)/z = 



h>i 



h 



yields 



where 



{k - d)^(d) ^ Td 



h>d+2 h{h-d-2)\ 



h>d+l h(h-d-l)\ 



E 



^ C^+^(m+d+l)\ 
2^m>0 (m+d+2)m! 



(15) 



E 



C^(m+d)\ 
m>0 (m+d+l)m! 



We now obtain upper and lower bounds of eq. ( 15 ) for any 
d — 1,2, . . .. To derive a lower bound note that 

m + d + 2 
(l + l/c/)(m + rf+l) ^ 

and that {m + d) / {m + d + 1) < 1, for m = 0, 1, . . .. Us- 
ing the above we decrease the numerator and increase the 
denominator of ( 15 ) obtaining the lower bound: 



(m+d+l)! 



fm>0 (l + l/d)(m+d+l)m\ 
C^(m+d)! 
^>0 {m+d)m\ 



Ctd 



{l + l/d){l-Ct)' 



Similarly, an upper bound of ( 15 ) can be obtained from 
applying the inequalities 



m + rf+ 1 



< 1 



{l + l/d)(m + d) 
and (m + (i+ l)/(m + + 2) < 1, for m = 0, 1, 



into ( 15 ) 



cl 



(m+d+l)! 



t>0 (m+d+l)m! 



C^im+d)\ 
'm>0 (l + l/d)(m+d)m! 



Thus, 



d 



Ct 



Ct 



{d+l)Ct 

a-Ct) 



<d+l 



for 



l + l/d 

which implies that for {k — (d + a)\d a) > {k — d)^(d) 
all a > 2 and d — 1, . . .. This means that at step t -\- 1 we 
should recruit the node v G N{B{t)) with either the largest 
value d{v) of all nodes in J\f{B{t)) or the second largest 
value. We hypothesize that improving the above bounds 
will reveal that node v should have the largest value of d{v). 
Thus, if the above hypothesis holds, recruiting the node with 
the largest observed degree from N{B{t)) maximizes the ex- 
pected cover increase. An interesting observation is that 
{k — d)^(d) increases with Ct and diverges as Ct 1. We 
now explore the case Ct ^ 1. 

Case Ct — 1: The case when Ct — 1 represents the case 
when {pk} is a pure power law distribution that is, (^k — 
k~'^ /((r), where ({r) is the Riemann zeta function with pa- 
rameter T (note that Lir(l) = C('^)) ^ t has little impact 
on the degree distribution of B{t). This case requires assum- 
ing N ^ oo and t — o{N) and thus it is just of theoretical 
interest as no real world network is an infinite power law 
network. Because C{a) ^ oo for a < 1 and C(a) < oo for 
a > 1, we observe that 



d 



[r-ll 



H{z) 



and 



a^rr-ii 



d^^^Hjz) 

dz\-^ 



< oo 



Thus, {k — d\d) diverges for d — \r] and converges for d < 
\r]. This implies that the expected (i-excess degree of a 
node with d = [r] recruited neighbors is infinite. Due to the 
appearance of subtractions between two infinite quantities 
we were unable to verify whether {k — d\d) oo for all 
d > \r] or, more importantly, whether or not {k — d — 
l\d + l)/{k - d\d) > 1 holds for ah d > \r]. We, however, 
hypothesize that {k — d\d) ^ oo for all d> \r]. And again 
we see that recruiting the node with the largest observed 
degree from J\f(B{t)) maximizes the expected cover increase. 
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Figure 6: Empirical average cover size {W{t)) of various social networks. Comparison between Oracle, RWnr, SI, BPS, DPS, 
and MOD algorithms. a:-axis in log-scale. 
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Figure 8: (Enron Network) Average d-excess degree of B{t) in SI as a function of step t = 1, . . . , N — 1. Observe that {k — 4|4) is 
consistently larger for all values of t than {k — and {k — 2|2). Also note that this happens notwithstanding how little the average 
degree of B{t) varies over t (blue dots). 
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Figure 9: The two datasets in which DFS outperf orm s M OD. Comparison between the empirical average cover size (lF( t)) o f Oracle, 
RWnr, SI, BFS, DFS, and MOD algorithms. Figs. [9a| and |9b| show the results on the original networks and Figs. [9c| and |9d| show the 
results in their randomized counterparts. Note that when randomized, we see similar results seen in the social networks. Thus, the good 
performance of DFS and poor performance of MOD are due to the peculiar network topology of these graphs, cc-axis in log-scale. 
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